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ABSTRACT 

A task-based formulation of Scalable Universal Matrix Mul¬ 
tiplication Algorithm (SUMMA), a popular algorithm for 
matrix multiplication (MM), is applied to the multiplication 
of hierarchy-free, rank-structured matrices that appear in 
the domain of quantum chemistry (QC). The novel features 
of our formulation are; (1) concurrent scheduling of multiple 
SUMMA iterations, and (2) fine-grained task-based compo¬ 
sition. These features make it tolerant of the load imbalance 
due to the irregular matrix structure and eliminate all ar- 
tifactual sources of global synchronization. Scalability of 
iterative computation of square-root inverse of block-rank- 
sparse QC matrices is demonstrated; for full-rank (dense) 
matrices the performance of our SUMMA formulation usu¬ 
ally exceeds that of the state-of-the-art dense MM imple¬ 
mentations (ScaLAPACK and Cyclops Tensor Framework). 
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1. INTRODUCTION 

A frontier challenge posed by scientific and engineering 
applications in areas as distinct as quantum physics and 
machine learning is dealing with sparse and non-standard 
tensorial data representations. Such data appears in many 
forms: sparse tensors, multiresolution spectral element trees, 
^-matrices, tensor networks, and many others. What they 
share in common is the reduced number of parameters in 
terms of which the data is represented, at the cost of more 
complex data representation and computation relative to 
the standard/naive counterpart. The need to deal with ir¬ 
regular data representations conflicts with the evolution of 
computer hardware and programming models that demand 
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regular patterns of data access and computation for peak 
performance. This tension drives the search for algorithms 
that minimize/avoid communication and/or hide its cost, 
and are capable of dealing with increasingly irregular nu¬ 
merical data structures. 

In this work we explore parallel computation with matri¬ 
ces composed of low-rank blocks that we refer to as Clustered 
Low-Rank (CLR) matrice^ Concrete examples of such ma¬ 
trices are taken from quantum chemistry and materials sci¬ 
ence. Matrices (tensors) in such context represent quantum 
states (of electrons) and operators represented in some basis. 
Efficient application of operators to states — represented by 
matrix multiplication (tensor contraction) — demands tak¬ 
ing advantage of the matrix (tensor) structure that take the 
form of (a) block-sparsity due to the distance decay of the 
operator kernel and the localized nature of basis functions 
[20| , (b) symmetries under geometric and other transforma¬ 
tions [27[ 19 , and/or (c) block-rank-sparsity due to smooth¬ 
ness of states and operator kernels Notably, exploiting 
this matrix structure depends on problem-specific blocking 
of matrix dimensions that arise due to domain-specific needs 
and typically cannot be chosen arbitrarily. In other words, 
the matrices that we encounter are “sparse” in a general 
sense, which encompasses element-, block-, and block-rank- 
sparsity; but in a practical sense the matrices are not sparse 
enough to be a good match for the established sparse MM 
algorithms. 

The key idea is a novel task-based variant of Scalable Uni¬ 
versal Matrix Multiplication Algorithm (SUMMA) of van de 
Geijn and Watts, [3^ , a well-known algorithm for distributed- 
memory dense matrix multiplication (MM). The nonuniform 
blocking and data inhomogeneity of CLR matrices conflict 
with the uniform data distribution exploited by all distributed- 
memory, dense MM algorithms — including Cannon’s [11| , 
SUMMA [^, and others [15[ |14[ |33[ |27| . Task-based 

formulation allows us to overcome this limitation. Task- 
based/dataflow programming models are a natural choice 
for implementation of algorithms with irregular data and 
computation patterns; such models have already been used 
successfully for dense matrix algebra applications [21[ [^. 
Besides handling matrices with structure, the task-based ap¬ 
proach provides additional benefits: (a) inter-node commu- 


^ Related matrix data structures have appeared under many 
names (matrices with decay, 7^-matrices, rank-structured 
matrices, and mosaic skeleton approximation), but no sin¬ 
gle globally-accepted terminolo gy e xists. For the history of 
these types of matrices see Ref |37|. 




nication costs can be partially or fully hidden by overlapping 
computation and communication, (b) performance should be 
less sensitive to topology, latency, and CPU clock variations, 
(c) fine-grained, task-based parallelism is a proven means to 
attain high intra-node performance by leveraging massively 
multicore platforms and hiding the costs of memory hierar¬ 
chy {e.g. Intel TBB, Cilk), (d) lack of global synchronization 
allows the overlap multiple, high-level stages of computation 
{e.g. two or more multiple matrix multiplications contribut¬ 
ing to the same expression). 

The new formulation was used to implement iterative com¬ 
putation of the square root inverse of a matrix, a prototypi¬ 
cal operation in which block ranks of intermediate matrices 
change dynamically during the iteration. The usual advan¬ 
tage of the task formulation, tolerance of load imbalance and 
latency, are demonstrated in the regime where matrices ap¬ 
proach full rank, by comparison against the state-of-the-art 
dense MM implementations. 


2. STANDARD SUMMA AND VARIANTS 

Before describing our algorithm, we start out by recap¬ 
ping the standard SUMMA and its variants for distributed- 
memory MM. Although there are earlier and asymptot¬ 
ically faster algorithms, SUMMA has become popular 
due to its relative simplicity and flexibility (it can be eas¬ 
ily generalized to rectangular matrices and process grids). 
From a just as important practical standpoint, SUMMA, like 
all 2D algorithms, uses memory more economically than its 
3D counterparts [^[^. SUMMA is also a building block for 
more complicated 2.5D and 3D MM algorithms [30| and pro¬ 
duced a number of variants, 16 including sparse SUMMA 
(SpSUMMA) dl]. 

SUMMA implements matrix multiplication C = AB as 
a series of rank-fc updates. The input and output matrices 
are embedded on a rectangular process mesh in an element- 
cyclic or block-cyclic manner to ensure approximate load 
balance. In each iteration of the algorithm, a column/row 
panel of A/B is broadcast along rows/columns of the process 
grid, respectively; matrix C remains stationary through¬ 
out the procedure (variants of SUMMA in which A or B 
are stationary are also possible; transposed multi plie s, e.g. 
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C = ABT, are also relatively simple to handle 
Each pair of broadcasts is followed by a rank-fe update, 
Cij •<— AikBkj + Cij, (Einstein summation convention is used 
throughout). Original SUMMA papers by van de Geijn and 
Watts and by Agarwal et al. d considered versions of 
the algorithm that overlapped computation and communica¬ 
tion by pipelining and preemptive broadcasts, respectively, 
and other broadcast variants have been considered [^, in¬ 
cluding topology-specific broadcasts 29 . For simplicity, we 


present a SUMMA version with preemptive broadcasts in 

Figured 

DIMMA, an early variation of SUMMA introduced by 
Choi in 1998, improved performance of synchronous SUMMA 
by realizing that the order of broadcasts in the original al¬ 
gorithm coupled with communication barriers created sig¬ 
nificant slack in communication 13 . Iterations were 
reordered in DIMMA such that each node broadcasts all 
of its data in succession as opposed to the round-robin ap¬ 
proach in SUMMA. Similar improvements can be attained 
by overlapping communication and computation d. 

SUMMA was recently extended to sparse MM (SpSUMMA) 
by BuIuq and Gilbert d 1^. This algorithm is similar to 


Algorithm 1 SUMMA with non-blocking broadcast 

1: Broadcast(A(*, 0),0,roui-jrroup) 

2: Broadcast(B( 0,*),0, coLqroup) 

3: for fe = 0,..., A - 1 do 

4: if A: -I- 1 < A then 

5: rowATOot (fc -|- 1) mod cols 

6: col_root t— (fc -|- 1) mod rows 

7: Broadcast(A(*, k -|- l),row^root, row^group) 

8: BROADCAST(i3(fc -|- 1, *), coBroot, coBgroup) 

9: end if 

10: WAIT(A(*,fe)) 

11: WAIT(A(fc,*)) 

12: G(*, *) <— q:A(*, k) • B{k^ *) -t- C'(*, *) 

13: end for 


dense SUMMA, except matrix sub-blocks are stored in a 
doubly compressed sparse column (DCSC) format and sparse 
generalized matrix-multiplication (SpGEMM) is used to com¬ 
pute rank-fc updates. The main challenges of all sparse MM 
algorithms, including SpSUMMA, are the increased relative 
costs of communication compared to the dense case, load 
imbalance, and the relatively low intra-node performance of 
sparse matrix kernels. 

The problem of load imbalance does not appear in dense 
SUMMA implementations as the work load is nearly-optimally 
balanced. With random sparsity, approximate load balance 
is achieved in the asymptotic limit, however with structured 
sparsity {e.g. matrices with decay) one does not expect 
rows/columns to be uniformly filled. 

In the regime of high sparsity (low matrix fill-in factors) 
the communication time dominates the computation time 
due to the effectively-reduced benefit of blocking. The in¬ 
creasing role of communication in sparse MM can be some¬ 
what alleviated by communication hiding. BuIuq and Gilbert 
discussed potential benefits of communication hiding for spa¬ 
rse MM in Ref. but did not pursue this approach in their 
experiments due to lack of quality one-sided communication 
tools [^. Another possibility is to minimize communica¬ 
tion, e.g. by switching to 3D MM Although sparse 2D 
SUMMA is not strongly scalable, nevertheless good scala¬ 
bility of SpSUMMA was demonstrated in practice [^. 

3. TASK-BASED SUMMA FORMULATION 

Our work to improve SUMMA is motivated by the needs 
of computation on matrices/tensors with irregular low-rank 
structure of their blocks. Some of the challenges of com¬ 
puting with such data are similar to the general challenges 
of sparse MM: increased communication/computation ratio 
and lack of load balance. The latter is compounded by the 
desire to use physics-based blocking of matrix dimensions as 
well as non-standard data representations. To address these 
challenges, we set to investigate a task-based formulation of 
SUMMA algorithm designed to partially offset some commu¬ 
nication latency and to alleviate the load imbalance. Prior 
efforts to reformulate dense matrix multiplication using a 
tasks-based programming models are known [^13; the nov¬ 
elty of our effort is the focus on dense matrice^tensors with 
irregular structure {i.e. block-rank-sparsity) that cannot be 
handled straightforwardly using the standard dense-only ap¬ 
proaches. 

In this section we briefly describe the design of our al- 














gorithm, by highlighting the differences with the procedural 
SUMMA implementations!^. We first analyze the data de¬ 
pendencies of discrete operations in the procedural SUMMA 
implementation. We then discuss the task composition and 
dependencies of our modified implementation. For simplic¬ 
ity, we only consider the 2D SUMMA implementation; though 
our approach is applicable to 3D and 2.5D variant of SUMMA. 

3.1 Control Flow of Standard SUMMA 

Like all dense MM algorithms, SUMMA consists of tightly 
synchronized data movement and computation (see Algo¬ 
rithm Namely, the rank-fc update, Cij ■(— AikBkj -\- Cij, 
of each iteration depends on the data from the broadcasts 
of the corresponding panels of A and B as well as the pre¬ 
vious iteration’s rank-fc update. In addition to these data 
dependencies, each broadcast is synchronized with a prior 
rank-fc update since communication operations are initiated 
at the beginning of each SUMMA iteration, as shown in 
Figure (we denote such sequence dependencies by dashed 
lines). Such a design ensures that only a minimal memory 
overhead occurs (although technically, nonblocking broad¬ 
casts require more memory than optimal). However, this de¬ 
sign also limits the work available to each processor (see the 
Figure [^, and therefore the amount of parallelism. Specif¬ 
ically, we can parallelize the rank-fc update of C as well as 
the column and row broadcasts of A and B, but SUMMA 
iterations — although almost independent from one another 
— are executed serially, one after the other. Furthermore, 
such design is not tolerant of any source of load imbalance, 
due to, for example, processor clock variation, network con¬ 
gestion, slack in communication, or — most important for 
us — due to the inhomogeneity of data from block size or 
rank variation. 


[ BCAST Aio] 

[ BCAST Poj] 


y 

[BCAST Ail] [Cij ■(-Aio 

■Boj+Cii] [bCASTPij] 



[BCAST Ai2] [Cij ^ Ail 

■Bij + Cij] [BCASTPaj] 



(BCAST Ais] [Cij ^ Ai2 

■B2j + Cij] [BCASTPaj] 

\^Cij ■<— Ai3 

Bsj + Cij ]] 


Figure 1: A directed acyclic graph of the procedural 
SUMMA implementation consisting of broadcast (BCAST) 
and rank-fc updates. Solid edges indicate data dependen¬ 
cies, and dashed edges indicate sequence dependencies. 


3.2 Task-Based Multiple-Issue SUMMA 

To tolerate the data inhomogeneity, whether due to block 
sparsity or block-rank sparsity and address the above de¬ 
ficiencies of the standard 2D SUMMA, we introduced the 
following modifications: (1) overdecomposition of data and 
work, (2) elimination of dependencies between SUMMA it¬ 
erations, (3) scheduling of multiple iterations of SUMMA at 
once, and (4) the use of task-decomposed broadcasts. 


Although the standard dense SUMMA allows near-perfect 
load balance of data by using cyclic embedding of matrices 
onto 2D process grid, it is not feasible to maintain such load 
balance with irregular data structures throughout the MM 
computation, even with uniform blocking. In fact the di¬ 
mension blocking is a central concept in the TiledArray 
library since it arises naturally in physical problems (hence 
the need to support arbitrary block sizes, including nonuni¬ 
form blocking). Thus we support an arbitrary blocking to 
match the physics of the problem, but assume that the data 
is overdecomposed (i.e. there are many blocks per proces¬ 
sor); this helps to improve load balance of the data. 

To improve the load balance of work, the artificial de¬ 
pendencies between SUMMA iterations are eliminated and 
multiple SUMMA iterations are scheduled at once. This 
increases the average number of computational tasks per 
processor, and thus improves the load balance under the as¬ 
sumption that the amount of work performed by each task is 
randomly distributed. Note that the dependencies between 
iterations is the result of GEMM operations (C = AB + C) 
via updates of the result matrix, C. To decouple this data 
dependency, we assume that there is enough memory avail¬ 
able to split the rank-fc update operation into two separate 
tasks: a matrix-multiplication task producing a temporary 
block, = Aik ■ Bkj, and a reduction of the temporary 

(k) 

into the result, Cij = -|- Cij. In our implementation, 

we mitigate the additional memory cost by automatically 
switching between GEMM and split matrix-multiply-and- 
reduces updates based on the data availability. Specifically, 
the GEMM update is used when only a single thread requires 
access to a sub-block and the split matrix-multiply and re¬ 
duces tasks are used when two or more threads must update 
the same sub-block. This update scheme is essential for high 
computational throughput and resource management in our 
task-based SUMMA formulation. 

Scheduling multiple SUMMA iterations at once impacts 
both performance and memory consumption. In that sense 
the multiple-issue SUMMA is similar to 2.5D and 3D MM al¬ 
gorithms 1^ that trade off memory to increase concurrency. 

The number of SUMMA iterations scheduled concurrently is 
a user-adjustable parameter and can be tuned to minimize 
memory consumption (see next Section) or to maximize the 
computational throughput. In practice, we found that the 
throughput is maximized when each process initiates /opt 
SUMMA iterations, where for a process grid with rows 
and Pc columns /opt = max(2, min(Pr, Pc)); this is the de¬ 
fault schedule depth, unless specified otherwise. As sched¬ 
uled SUMMA iterations are retired, additional iterations are 
scheduled, in a pipelined fashion. 

Lastly, data broadcasts are decomposed into several smaller 
tasks, which allows overlap of communication and compu¬ 
tation at the task level. For example, a given process can 
start the computational work for a given iteration as soon 
as the minimal amount of data is available, and execute 
concurrently with the remaining communication tasks. In¬ 
tegration of communication operations allows more efficient 
work scheduling based on the availability of data in our task 
approach and improves tolerance of irregular, high-latency 
communication. 

3.3 Memory Overhead of Multiple-Issue SUMMA 

The dynamic scheduling of computation and communica- 






























tion trades off predictable bounds on resource use, in partic¬ 
ular memory, for high performance. The maximum memory 
requirement per process of our multiple-issue SUMMA, for 
the multiplication of and R^^^ dense matrices with 

average block sizes of m x fc and k x n, respectively, is pro¬ 
portional to: 

fMk Nk\ MN + MK + KN 


where I is the number of concurrently-scheduled iterations; 
and Pi and Pc are the number of rows and columns in the 
process grid, respectively. Eq. 0 correspond to the memory 
requirements of the 2D, bulk-synchronous SUMMA when 
7=1. Our implementation also uses additional temporary 
storage for replicated sub-blocks of the result matrix (not 
shown in Eq. 0 ). but this is a negligible overhead in large 
problems where the number of blocks is much larger than 
the number of cores. If, for simplicity, we assume square 
matrices and process grid (M = A^ = 77, m = n = fc, 
Pi = Pc = y/P) then Eq. 0 reduces to 


2Nk 

W 



( 2 ) 


where I{2Nk/y/l^) is the memory overhead due to the par¬ 
tial replication of the argument matrices. Under the as¬ 
sumption of overdecomposition, the block size k is much 
smaller than N / y/P, hence the first term in Eq. 0 is much 
smaller than the second, even for 7 > 1. Note that Eq. 0 
is derived under the pessimistic assumption that the rate 
of computation is much lower than the rate of communica¬ 
tion. In practice, however, as soon as the data arrives it is 
consumed by the compute tasks, hence the effective value 
of 7 is lower than the actual number of SUMMA iterations 
scheduled. 

The average memory overhead for block-sparse SUMMA 
can be estimated by scaling each matrix, and correspond¬ 
ing replicated blocks, by the fraction of non-zero elements. 
In addition, some process may not contain non-zero data 
for a given iteration. Therefore, the memory overhead of 
the replicated blocks by {Pi)/Pi and {Pc)/Pc- With these 
modifications, Eq. 0 becomes: 


(1 - zc)MN -7(1- zk)MK -7(1- z^)KN 
'KPc 


(3) 


where 1 — Zx the fraction of non-zero elements for matrix X, 
and {Pi) and {Pc) are the expectation values for the number 
of row and column processes, respectively, with non-zero 
contributions in a single SUMMA iteration. 

Note that our implementation may also use less memory 
than that given in Eqs. 0 and By overdecompos¬ 

ing the rank-fc-update and broadcast tasks such that each 
matrix-multiplication task only computes a small sub-block 
of the local result matrix. This decomposition is similar to 
the decomposition of work between nodes within a SUMMA 
iteration, but without the analogous communication (depen¬ 
dencies) between computation tasks. It allows streaming of 
data so that processes do not need to hold all data for the 
block row and column in a given iteration. In addition, 
higher priority is given to tasks that free resources, which 
limits the accumulation of temporary storage space. Unfor¬ 


tunately, a priori prediction of the actual memory usage is 
not possible due to the non-deterministic order of execution. 

With block-rank-sparse matrices, used in our square root 
inverse algorithm (see Sections |4.1| and |4.2| for details), the 
exact memory requirements are similarly non-deterministic 
as the rank may grow or shrink based on the order of op¬ 
erations and nature of the input data. However, the upper 
memory bound is equal to that of the full-rank (dense) com¬ 
putation. Unfortunately, the rank of each block cannot eas¬ 
ily be determined until runtime, making a priori estimations 
of the memory requirements difficult. 

4. RESULTS 

In this section, we: (a) discuss the implementation de¬ 
tails of tensor arithmetic in low-rank form, (b) evaluate the 
performance of our methods with this low-rank data struc¬ 
ture by computing the inverse square root of matrices used 
in the quantum chemistry domain, and (c) demonstrate the 
performance of our implementation for dense matrix multi¬ 
plication. 

Our 2D task-based SUMMA algorithm is implemented in 
the Tiled Array (TA) library [^, which uses the paral¬ 
lel runtime of MADNESS to manage the low-level de¬ 
tails of task scheduling and data movement. Unless noted 
otherwise, we used TA 0.5.0-alpha, and MADNESS dated 
08/26/2015. Serial DGEMM from the Intel Math Kernel 
Library (MKL) 11.2.3 was used as the block multiply-add 
kernel in our MM tasks. TA and MADNESS were compiled 
with the Intel Parallel Studio XE 15.3 and Intel MPI 5.0. 
Both TA and MADNESS can be obtained under the terms 
of the GNU General Public License. 

These tests were performed on a 408-node Cray CS-300 
cluster, with two eight-core Intel Xeon E5-2670 CPUs and 
64 GB of memory per compute node. In addition, we per¬ 
formed dense MM scaling tests on the IBM BlueGene/Q 
Mira supercomputer at Argonne National Laboratory. 

4.1 Clustered Low-Rank Representation 

To recover data sparsity in tensors appearing in quantum 
physics applications we developed the Clustered Low-Rank 
(CLR) representation [23] that is a general, hierarchy-free 
compressed tensor format. In this representation, each Mxj- 
block of matrix M is approximated by a low-rank decompo¬ 
sition of the form Mij « XW^, where for a given 'M.ij € 
R™x" of rank r, X is m x r and W is n x r. X and W were 
constructed from a rank-revealing QR decomposition, [2^ 
Mij-P = QR. Thus X = Qr and where 

subscript r indicates that only r most significant columns 
and rows of Q and R were kept; r was determined so that 
the Frobenius norm of the low-rank approximation error, 
\\Mij - XW^IIf is less than or equal to eu, a user-defined 
threshold. For blocks where the rank of Mij- is greater than 
half of full rank, the blocks are stored in their full represen¬ 
tation. 

To maintain compression, all block operations, such as ad¬ 
dition and GEMM, are performed in low-rank form directly 
whenever possible. Low-rank matrix multiplication of C = 
AB, where all matrices appear in low-rank form, uses the 
following steps. First, create a temporary Z = (W'^)^X®, 
where the superscript letters designate the matrix that X 
and W were constructed from, leaving 

X*^(W*^)''' = X^Z(W®)^ 


(4) 
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Figure 2: Wall time for computing the square root inverse 
of the Coulomb matrix. 1 — 1 refers to data obtained with 
the standard single-issue SUMMA; the rest of data obtained 
with the multiple-issue SUMMA. 


Figure 3: Wall time for computing the square root inverse 
of the overlap matrix. 7=1 refers to data obtained with 
the standard single-issue SUMMA; the rest of data obtained 
with the multiple-issue SUMMA. 


Then contract Z with or W® such that the output rank 
is minimized; contracting Z with X'^ yields X*^ = X'^Z and 
W<= ^ w®. 

Low-rank addition of C = A + B may be viewed as a sum 
of outer products: 

■TA ’’B 

X*^(W*^)'''^x®*(wf)^ (5) 

i=l 3=1 

where xf", wf", x®, and w® represent column vectors of the 
corresponding matrix and ta and vb are the ranks of A and 
B respectively. From Eq. ([^, it is clear that the low-rank 
matrices representing C are the union of the input matrices, 
with X*^ = {X^X®} and = {W^W®}. 

Finally, in order to compress a matrix that is already in 
a low-rank form C = X(W)^ to a more compressed form 
C = X'(W')^ we first perform a QR decomposition of X 
and W 

X'(W')''' = (6) 

and then form a temporary matrix M = R^(R''^)^ giving 

US 

X'(W')''' = Q^M(Q’^)1'. (7) 


tiplicativity of the Frobenius norm, e.g. \\Aij -f Hij\\f < 
IIAulli? -I- ||Bij-||i?, hence the right-hand side of the in¬ 
equality is used as the estimate. Complete details of screen¬ 
ing the arithmetic operations are provided in Ref. |23| . 

To summarize: the accuracy of CLR representation and 
arithmetic is controlled by two user-defined parameters, eir 
and Esp. As e\r —> 0, the CLR representation becomes exact; 
similarly, when Esp —^ 0 the arithmetic on CLR matrices 
becomes exact. 


4.2 Iterative Square Root Inverse 

To compute the inverse square root of a matrix M, where 
the matrix square root is given by the matrix such that 

M1/2mi/ 2 ^ M and we used the iterative 

matrix-mulitplication approach in based on the Newton- 
Schulz method 18 25 consisting of the following steps in 
each iteration: 


X„ — qY„Z„, (8) 

T„ = i (-10X„ -b 3X^ -b 151) , (9) 

O 

Zn + l = Z„T„, (10) 

Y„+i = T„Y„, (11) 


We can perform a low-rank decomposition of M = X'^ (W'^)^ 
creating a new compressed representation of the original 
block where X' = Q^X’^ and W' = Q’^W’^. After com¬ 
pression, the rank of C has been decreased from ta+tb to 
the rank of M. 

Arithmetic on CLR matrices is implemented in terms of 
the above low-rank block arithmetic. To avoid comput¬ 
ing blocks whose norm will be smaller than target preci¬ 
sion, block Cij of result matrix C is only computed if its 
Frobenius norm estimate satisfies HCuHf < esparea(Cu), 
where area(Cij') is the number of elements in Cxj, and Esp 
is a user-defined parameter. Estimated of Frobenius norms 
of sums and products of blocks are estimated using the up¬ 
per bounds provided by the triangle inequality and submul- 


where I is the identity matrix, and upon convergence yAlZ = 
is the sought square root inverse. The starting guesses 
are Zo = I, Yo = M, and a is chosen to scale M such that 
||aM - I|j2 < 1. 

Square root inverses of the overlap and Coulomb oper¬ 
ator matrices, S and V respectively [^, were computed 
for a cluster of 190 water molecules with the cc-pVTZ-RI 
Gaussian basis, consisting of 141 basis functions per water 
molecule, for a total basis size of 26,790 basis functions. The 
matrices use natural blocking, i.e. each 141 by 141 block 
spans basis functions associated with the corresponding wa¬ 
ter molecule. Esp = 10“^^ and ei^ = 10“® were used. 

Our performance tests executed 10 iterations of this algo¬ 
rithm; the resulting wall times are reported in Figs, [^and 
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Block Rank 

Figure 4: Distribution of block ranks of the intermediate 
matrices from the 5th iteration of the inverse square root 
computation of the overlap matrix. A rank of “0” indicates 
a block contains no data, and “141” indicates full rank. 


Good strong scaling is observed: for example, for the 
Coulomb matrix in dense and low-rank representations the 
parallel efficiency is maintained at 43% and 59%, respec¬ 
tively, upon scaling from 1 to 128 nodes. For the overlap 
matrix, the corresponding figures are 28% and 31%. The ob¬ 
served differences in scaling between overlap and Coulomb 
are likely due to the much faster (exponential) decay of over¬ 
lap matrix with distance and hence greatly reduced amount 
of work due to more efficient screening of matrix operations: 
without such screening the time to compute the inverses of 
Coulomb and overlap matrices in full-rank representation 
would take exactly the same amount of time, whereas in 
practice the inversion of the overlap is roughly a factor of 2 
faster than that of the Coulomb matrix in full-rank form. 

Unlike S, V does not have any block-sparsity, only block- 
rank-sparsity. Hence computing in the full-rank form 

essentially performs the same amount of work as would an 
implementation using a standard dense matrix package. The 
apparent performance of the full-rank inverse was estimated 
by counting FLOPs from 4 GEMMs per each iteration; the 
resulting performance on 1 node for the Coulomb case is 263 
GLOPs, or 79% of theoretical peak, which is close to opti¬ 
mal. On 128 nodes the apparent throughput is respectable 
99 GLOPs. 

The benefit of the low-rank representation is immediately 
apparent for the overlap matrix on 1 node, whereas for the 
Coulomb matrix the low-rank implementation outperforms 
the full-rank counterparts when the number of processors is 
large, due to the greatly reduced amount of communication 
in the low-rank case. Also obvious is the importance of 
scheduling multiple SUMMA iterations for performance in 
the high-end strong scaling regime for both dense and low- 
rank matrices. For example, on 128 nodes the use of single¬ 
issue SUMMA when computing 8“^^^ in low-rank form is 
97% slower than multiple-issue SUMMA. 

Figure [^demonstrates the extent of the data inhomogene¬ 
ity in the matrices involved in these computation. It shows 
a histogram of block ranks of 5th iteration intermediates X, 
Y, Z, T, as well as the input S matrix. Most blocks in these 
matrices have very low ranks. However, a significant number 
of blocks have close-to-full rank, which is typical behavior 
for matrices with decay. 



Number of Cores 

Figure 5: Wall time of square dense matrix multiplication 
with Tiled Array, CTF, and ScaLAPACK. Matrix size = 
32,768. 


4.3 Dense Matrix Multiplication 

In the limit where the low-rank threshold and sparsity 
thresholds become zero, i.e. en —> 0 and Csp —>■ 0, CLR 
matrices become full-rank (dense) matrices. In addition, 
the features of our approach that make it appropriate for 
rank-structured matrix multiplication also lead to excellent 
performance for dense matrices. Thus, in this section we 
evaluate the performance of our SUMMA formulation in 
TiledArray (TA) [^ for the case of dense, full-rank ma¬ 
trices in absolute terms and relative to that of ScaLAPACK 
1^ and Cyclops Tensor Framework (CTF) 1.1 [^ sT 


a state-of-the-art, dense tensor contraction implementation 
based on 2.5D SUMMA. Performance was evaluated in two 
prototypical setups: on a commodity cluster and on a high- 
end IBM BlueCene/Q Mira supercomputer at Argonne Na¬ 
tional Laboratory. 

On the commodity cluster, we used the version of ScaLA¬ 
PACK provided by Intel MKL 11.2.3. CTF was compiled 
with the same Intel compiler described at the beginning of 
this section. Unlike TA, however, ScaLAPACK and CTF 
use the threaded version of BLAS DGEMM routine from 
MKL for their multiply-add kernels. 

Due to the constraints of the existing parallel dense MM 
software and to simplify the performance analysis, we per¬ 
form multiplications of square, uniformly-blocked, double¬ 
precision matrices; our implementation is completely gen¬ 
eral. The matrix size used in the tests on the commodity 
cluster was 32,768, with block size of 256 used for TiledAr¬ 
ray. Reported wall times are averages of 15 repeated mul¬ 
tiplications of same input matrices. 

In Figurej^ we show the result of our strong scaling tests, 
which vary from 16 to 1024 cores. Each of the MM soft¬ 
ware packages shows linear scaling across the range of the 
tests. However, of the three packages, TA achieved the low¬ 
est computational time in all cases. In fact, the performance 
differences between TA, ScaLAPACK, and CTF are signif¬ 
icant. On 64 compute nodes, ScaLAPACK took between 
1.36 ~ 2.50 times longer to complete relative to TA, while 










































CTF took 1.25 ~ 1.59 times longer to complete. The paral¬ 
lel efficiency of TA MM relative to 1 node is between 84.3% 
and 63.7% on 64 nodes. 

We also studied the performance of TA with the number 
of concurrent iterations, /, set to one to demonstrate the 
performance benefits of multiple-issue SUMMA iterations. 
As expected, the performance of TA with 1=1 was approx¬ 
imately the same on a single node, but performs much more 
slowly with larger node counts. Scaling with a single-issue 
(standard) SUMMA is linear, but substantially slower than 
that of multiple-issue SUMMA, and is consistent with the 
square root inverse performance tests in Section [4.2| 


sparsity. An implementation of our algorithm in an open- 
source library TiledArray was used to compute iteratively 
square-root inverses of two realistic matrices from the do¬ 
main of quantum chemistry, using block-sparse and block- 
rank-sparse representations. Excellent strong scaling from 
16 to 2048 cores was observed on a commodity cluster. For 
dense matrices our implementation is competitive with state- 
of-the-art dense tensor algebra libraries like Cyclops Ten¬ 
sor Framework, both on a commodity cluster using up to 
1024 cores as well as on a high-end IBM BG/Q supercom¬ 
puter using up to 262 thousand cores. 
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Figure 6: Wall time of square dense matrix multiplication 
with TiledArray, CTF, and ScaLAPACK on IBM Blue- 
Gene/Q. The matrix sizes included in these tests are 32,768, 
65,536, 98,304, and 256,000. 


Finally, we evaluated the performance of our SUMMA 
implementation on the IBM BlueGene/Q Mira supercom¬ 
puter at Argonne National Laboratory. In these perfor¬ 
mance tests, we used TA 0.4.0-alpha, MADNESS dated 
03/02/2015, ScaLAPAGK 2.0.2, and CTF 1.1. The parallel 
version of ESSL 5.1 was used for the CTF and ScaLAPACK 
performance tests, while the serial version is used for TA. 
TA and MADNESS were compiled with GCC 4.7.2; CTF is 
compiled with XLC 12.1. Both used the V1R2M2 compiler 
driver. 

In Figure we show the performance of TA, CTF, and 
ScaLAPACK with matrices of various sizes. TiledArray 
used 128 X 128 blocks. The number of cores varies from 
512 to 262144. Both TA and CTF are found to be competi¬ 
tive with each other, with excellent strong and weak scaling, 
though each behaves differently across the range of the tests. 
We found ScaLAPACK, on the other hand, to scale very 
poorly relative to the TA and CTF, with almost an order of 
magnitude difference in computational time on 16384 cores. 


5. CONCLUSIONS 

We presented a task-based multiple-issue formulation of 
the Scalable Universal Matrix Multiplication Algorithm (SU¬ 
MMA) that performs well for multiplication of matrices with 
irregular structure, such as block-sparsity and block-rank- 
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